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Abstract 

Complete  Exchange  requires  each  of  N  processors  to  send  a  unique 
message  to  each  of  the  remaining  N  —  1  processors.  For  a  circuit 
switched  hypercube  with  N  =  2d  processors,  the  Direct  and  Standard 
algorithms  for  Complete  Exchange  are  optimal  for  very  large  and  very 
small  message  sizes,  respectively.  For  intermediate  sizes,  a  hybrid 
Multiphase  algorithm  is  better.  This  carries  out  Direct  exchanges  on 
a  set  of  subcubes  whose  dimensions  are  a  partition  of  the  intege-  d. 
The  best  such  algorithm  for  a  given  message  size  m  could  hitherto 
only  be  found  by  enumerating  all  partitions  of  d. 

The  Multiphase  algorithm  is  analyzed  assuming  a  high  perfor¬ 
mance  communication  network.  It  is  proved  that  only  algorithms  cor¬ 
responding  to  equipartitions  of  d  (partitions  in  which  the  maximum 
and  minimum  elements  differ  by  at  most  1)  can  possibly  be  optimal. 
The  run  times  of  these  algorithms  plotted  against  m  form  a  hull  of 
optimality.  It  is  proved  that,  although  there  is  an  exponential  nur  ber 
of  partitions,  (1)  the  number  of  faces  on  this  hull  is  0(v/d),  (2)  the 
hull  can  be  found  in  Q(\/d)  time,  and  (3)  once  it  has  been  found,  the 
optimal  algorithm  for  any  given  m  can  be  found  in  O(log<f)  time. 

These  results  provide  a  very  fast  technique  for  minimizing  com¬ 
munication  overhead  in  many  important  applications,  such  as  matrix 
transpose,  Fast  Fourier  transform  and  ADI. 
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Institute  for  Computer  Applications  in  Science  k  Engineering,  Mail  Stop  132C,  NASA 
Langley  Research  Center,  Hampton,  VA  23681-0001. 
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1  Introduction 


On  a  distributed  memory  parallel  computer,  the  complete  exchange  or  all- 
to-all  personalized  communication  pattern  requires  each  of  N  processors  to 
send  a  unique  rn-byte  message  to  each  of  the  remaining  N  —  1  processors. 
This  pattern  arises  in  many  important  algorithms,  such  as  matrix  transpose, 
vector-matrix  multiply,  Fast  Fourier  transforms,  etc.  It  is  also  of  importance 
in  its  own  right  since  it  is  the  densest  communication  requirement  that  can 
be  imposed  on  an  interconnection  network.  The  time  required  to  carry  out 
the  complete  exchange  is,  thus,  a  useful  measure  of  the  power  of  a  parallel 
computer  system.  Finally,  in  many  applications  that  require  a  dense  com¬ 
munication  pattern  that  is  a  subset  of  the  complete  exchange,  it  is  usually 
beneficial  to  use  a  highly  tuned  complete  exchange  routine  rather  than  at¬ 
tempting  to  write  specific  code  for  the  required  communication. 

On  circuit  switched  hypercubes,  such  as  the  Intel  iPSC-860  and  the 
nCUBE-2,  there  are  two  basic  algorithms  for  obtaining  the  complete  ex¬ 
change.  For  a  hypercube  with  N  =  2d  processors,  the  Standard  exchange 
algorithm  attempts  to  minimize  the  impact  of  startup  time  of  a  message  by 
combining  several  messages  into  ^ne  ‘super’  message  and  using  only  d  =  log  N 
message  transmissions^  1].  After  each  transmission,  a  shuffle  step  serves  to 
route  messages  towards  their  correct  destinations.  This  algorithm  suffers 
from  substantial  overhead  of  data  permutation. 

The  Direct  algorithm  uses  N  —  1  carefully  scheduled  ‘direct’  transmissions, 
relying  on  knowledge  of  the  routing  algorithm  used  by  the  hardware  to  avoid 
message  contention[14,  16,  17].  This  algorithm  has  no  data  permutation 
overhead  but  suffers  from  N  —  1  message  startups.  It  is  demonstrable  that 
the  Standard  exchange  algorithm  is  best  for  very  small  message  sizes,  while 
the  Direct  algorithm  requires  minimum  time  for  very  large  messages[3]. 

Multiphase  complete  exchange  is  a  hybrid  algorithm  that  combines  the 
features  of  the  Standard  exchange  and  Direct  algorithms.  It  carries  out  the 
complete  exchange  as  a  series  of  ‘partial’  exchanges  on  a  set  of  subcubes[2, 
4,  9,  10].  It  permits  a  compromise  between  the  message  transmission  and 
permutation  overhead  of  Standard  exchange  and  the  message  startups  of  the 
Direct  algorithm. 

The  multiphase  algorithm  has  been  implemented  and  shown  be  useful  on 
the  iPSC-2  and  iPSC-860  hypercubes.  For  a  given  hypercube  dimension  d, 
the  number  of  possible  multiphase  algorithms  equals  the  number  of  partitions 
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of  the  integer  d.  This  is  an  exponential  (though  slowly  growing)  number 
and  hitherto  the  only  way  to  find  the  best  multiphase  algorithm  for  a  given 
message  size  was  to  enumerate  all  these  partitions. 

In  this  paper  we  carry  out  a  detailed  analysis  of  the  hull  of  optimality 
of  all  such  multiphase  algorithms.  We  make  the  assumption  that  the  time 
to  transmit  a  message  from  one  processor  to  another  is  independent  of  the 
number  of  communication  links  traversed.  This  assumption  is  valid  for  most 
high-performance  circuit-switched  machines. 

Our  analysis  reveals  that  only  algorithms  corresponding  to  equipartitions 
of  d  (partitions  in  which  the  largest  and  smallest  elements  differ  by  at  most 
1)  can  ever  be  optimal.  Furthermore,  the  number  of  potentially  optimal 
algorithms  is  always  between  2 \fd  —  1  and  3 \/d.  We  show  that  the  hull  of 
optimality  can  be  found  in  0(\/d)  time.  Once  the  hull  has  been  obtained, 
the  optimal  algorithm  for  a  specific,  value  of  message  size  m  can  be  found  in 
0(logd)  time. 

This  result  provides  a  very  fast  method  of  finding  the  optimal  algorithm 
fur  a  given  message  size  and  thus  helps  in  reducing  the  communication  over¬ 
head  in  a  variety  of  important  parallel  applications.  The  0(logd)  time  for 
finding  the  optimal  algorithm  is  so  fast  that  it  may  well  be  feasible  to  choose 
the  algorithm  during  the  course  of  program  execution,  based  on  the  dimen¬ 
sion  of  the  hypercube  and  the  size  of  the  message  currently  being  transmitted. 

In  Section  2  of  this  paper  we  discuss  the  complete  exchange  communica¬ 
tion  pattern  and  present  the  three  algorithms.  Section  3  contains  our  main 
analysis  in  which  we  present  our  notation,  properties  of  equipartitions,  main 
theorems,  and  obtain  bounds  on  the  number  of  faces  on  the  hull.  We  con¬ 
clude  with  a  discussion  of  the  ramifications  of  our  results  and  suggestions  for 
future  research  directions. 


2  The  Complete  Exchange 

Complete  Exchange  requires  each  of  N  processors  of  a  parallel  machine  to 
send  a  different  message  to  each  of  the  remaining  N  —  1  processors.  This 
pattern  arises,  for  example,  when  transposing  a  matrix  of  N  x  TV  blocks  that 
has  been  distributed  over  N  processors,  with  one  column  per  processor.  The 
transpose  requires  each  processor  to  send  a  different  block  to  each  of  the 
remaining  processors.  The  resulting  communication  pattern  is  equivalent  to 
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the  complete  directed  graph  of  N  nodes. 

The  matrix  mapping  described  above  is  required  when  using  the  Alternat¬ 
ing  Directions  Implicit  (ADI)  method  for  solving  partial  differential  equations 
[6,  13].  This  method  requires  access  to  the  matrix  by  rows  and  columns  in 
successive  phases,  necessitating  heavy  use  of  a  transpose.  Matrix-matrix  and 
matrix-vector  multiplies  have  similar  requirements.  Complete  exchanges  are 
also  required  in  many  implementations  of  the  parallel  FFT. 

The  complete  exchange,  being  equivalent  to  the  complete  directed  graph, 
is  the  densest  communication  requirement  that  can  be  imposed  on  a  network. 
The  time  required  by  the  complete  exchange  is  an  upper  bound  on  the  time 
required  by  any  other  pattern  and  thus  provides  a  useful  measure  of  the 
power  of  a  distributed  memory  parallel  system. 

2.1  Standard  Exchange 

The  Standard  exchange  algorithm  was  presented  by  Johnsson  &  Ho[ll]  and 
uses  log  N  transmissions  of  size  N/2  blocks  each.  All  communications  are 
over  single  links,  therefore  no  attention  needs  to  be  paid  to  the  routing  al¬ 
gorithm  (in  effect,  the  algorithm  does  the  routing  itself).  The  overheads  in 
this  algorithm  are  due  to  shuffling  and  the  long  message  sizes  that  need  to 
be  transmitted.  Despite  this,  the  algorithm  is  competitive  for  small  block 
sizes,  since  the  total  number  of  messages  it  transmits  is  log  N  as  opposed  to 
N  —  1  for  the  Direct  algorithm. 

procedure  Standard-Exchange; 
begin 

for  j  =  d  —  1  downto  0  do 

begin  DTK 

if  (bit  j  of  mynumber  =  0)  then 
message  =  blocks  n/2  to  n  —  1 

else 

message  =  blocks  0  to  n/2  —  1; 
send_message_to.processor((mynnm6er)  ©  (2J)) 
shuffle  blocks; 
end; 

end; 


3 


2.2  Direct  Algorithm 

The  Direct  algorithm  was  first  reported  (in  Japanese)  by  Take  [17]  and  later 
by  Seidel  et  al . [  1 4 ,  161.  In  this  algorithm  each  processor  sends  out  N  —  1 
messages,  one  to  each  of  the  remaining  processors.  The  issue  is  to  schedule 
the  transmissions  such  that  no  edge  contention  takes  place.  Assuming  the 
almost  universal  ‘e-cube’  routing  algorithm,  the  exclusive-OR  schedule  de¬ 
scribed  below  achieves  contention-free  transmission.  This  algorithm  always 
outperforms  Standard  Exchange  for  large  message  sizes. 

procedure  Direct; 
begin 

for  i  =  1  to  n  —  1  do 

send_block J,o_processor((mj/num6er)  4)  (i)); 

end; 


2.3  Multiphase  Complete  Exchange 

The  multiphase  algorithm  combines  the  Standard  exchange  and  the  Direct 
algorithms  into  one  unified  algorithm.  It  carries  out  the  complete  exchange 
as  a  sequence  of  two  or  more  ‘partial’  exchanges.  This  algorithm  has  been 
implemented  on  the  iPSC-2  and  iPSC-8G0  [2,  4,  9,  10].  A  complete  exchange 
on  a  hypercube  of  dimension  d  with  n  =  2d  processors  and  block  size  m  is 
done  using  a  set  of  partial  exchanges  V  =  {d| ,  r/2,  •  •  • , r4},  on  k  subcubes, 
where  each  d,  specifies  the  dimension  of  the  A-th.  subcube.  Obviously  \V\  =  k, 
1  <  A',  and  E,-_ xdt  —  d.  Each  partial  exchange  is  called  a  phase. 

The  j th  partial  exchange  is  done  on  the  set  of  subcubes  determined  by  bits 
—  dj  to  T,Ji=ldi  of  the  hypercube  node  labels.  In  the  partial  exchange 
for  the  ith  phase,  2d~d'  blocks  of  m  bytes  each  are  transmitted,  to  each  of 
2d'  —  1  processors.  The  effective  block  size  is  thus  m2d~d' . 

procedure  Multiphase; 

{  d:  dimension  of  the  hypercube 

n:  number  of  phases  (subcubes)  in  partition  V 

dp.  dimension  of  the  /th  subcube  in  partition  P 
.s/«r/:starting  bit  of  subcube  label 
stop',  ending  bit  of  subcube  label  } 
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begin 

start  =  d  —  1; 

for  i  =  1  to  n  do 

{Partial  exchange} 
begin 

stop  =  start  —  d{  +  1; 
compute  effective  blocksize; 
for;  =  1  to  (2staTt~stop+x  -  1)  do 

send_effective.block_to_processor((mynum6er)  ®  (;2s‘op)); 
shuffle  blocks  d,  times; 
start  =  stop  —  1; 
end; 

end; 

In  the  above  algorithm,  when  k  =  d,  all  d,s  are  1.  In  this  case  the  outer  i 
loop  is  executed  k  times  with  start  =  stop  =  d  —  1 ,  d  —  2,  •  •  • ,  1 , 0.  The  inner 
j  loop  is  executed  only  once  for  each  i.  In  this  case  Multiphase  degenerates 
into  Standard  exchange.  When  k  —  1  and  therefore  d]  =  d,  the  outer  loop  is 
executed  only  once,  stop  always  equals  0  and,  in  the  inner  loop,  j  takes  on 
the  values  1, 2,  •  •  • ,  2d  —  1  and  thus  Multiphase  becomes  Direct. 

In  our  analysis,  we  have  assumed  that  the  complete  exchange  corresponds 
exactly  to  a  transpose.  Thus  not  only  do  blocks  have  to  be  transmitted  among 
processors  but  each  block  needs  to  be  placed  in  memory  in  the  destination 
processor  in  its  ‘correct’  transposed  position.  This  accounts  for  the  shuffle 
at  the  end  of  the  last  partial  exchange.  When  there  is  only  one  phase,  i.e. 
the  algorithm  corresponds  to  Direct  exchange,  the  last  set  of  d  shuffles  is 
equivalent  to  the  identity  permutation  and  is  redundant.  In  the  interest  of 
simplicity,  this  has  not  been  excluded  from  our  analysis. 


2.4  Implementation 

A  detailed  evaluation  of  the  performance  of  the  Standard  Exchange  and  Di¬ 
rect  algorithms  appears  in  [3].  The  multiphase  algorithm  has  been  evaluated 
in  [2,  4],  wherein  it  has  been  shown  that  this  approach  can  improve  perfor¬ 
mance  by  as  much  as  a  factor  of  2. 
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3  Analysis  of  the  Multiphase  Algorithm 

The  performance  parameters  characterizing  a  typical  hypercube  architecture 
are  given  in  Table  1.  r  is  the  time  to  transmit  one  byte  while  p  the  time  to 
move  a  byte  from  one  memory  location  to  another,  on  the  same  processor. 
A  is  the  startup  time,  the  time  that  elapses  from  issuance  of  a  transmit 
request  to  initiation  of  transmission  of  the  first  byte.  8  is  the  distance  impact, 
that  is  the  time  required  for  a  message  to  travel  across  the  communication 
network  of  the  processor.  We  assume  this  to  be  independent  of  the  number 
of  communication  links  traversed. 

We  omit  the  overhead  of  processor  synchronization  from  our  analysis. 
Each  phase  of  our  algorithm  takes  a  precise  amount  of  time.  If  all  processors 
keep  their  clocks  synchronized,  there  is  no  need  for  a  global  synchronization 
operation  between  phases,  as  the  time  to  start  a  new  phase  can  be  com¬ 
puted  by  each  processor  independently.  The  issue  of  clock  synchronization 
on  hypercubes  is  discussed  in  [7]. 


Table  1:  Performance  parameters  of  a  hypercube 


Description 

Units 

T 

transmission 

time  per  byte 

P 

data  permutation 

time  per  byte 

A 

startup  (latency) 

time  per  message 

8 

distance  impact 

time  per  message 

The  time  taken  for  a  message  of  size  m  bytes  is  rm-fi  A-f  The  Standard 
Exchange  algorithm  requires  d  transmissions  of  m2d_1  bytes  each,  and  d 
shuffles  on  2d  blocks  of  m  bytes.  This  leads  to 


^standard  —  d(^T7Jl2  T  A  T  (5)  T  dp2>  771 


=  d 


1^2  +  p)^711  + 


The  Direct  algorithm  needs  2d  —  1  transmissions  of  m  bytes  each,  giving 
us 


^direct  —  (2**  —  1)  [ T7T71  +  (A  +  £)]  . 
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A  Multiphase  algorithm,  with  n  phases  of  size  d,  each,  requires  for  the  j’th 
phase  2d‘  —  1  transmissions  of  m2d~d'  bytes  each,  followed  by  a  permutation 
of  2d  bytes.  Thus 

td4t  =  (: 2d 1  -  1)(A  +  TTn2d~d'  +S)  +  pm2d 

=  {(i-^)r+l>Y2dm  +  (2d'-l)(X+6).  (1) 

Since  ^  ^  the  total  time  required  by  the  Multiphase  algorithm  is 

n 

^multiphase  ” 

t=i 

=  £{(l-53-)T  +  *>}2''m  +  (2*  -  0(A  +  ^).  (2) 

i=  1  Z  ‘ 

3.1  Finding  the  best  Multiphase  algorithm 

In  our  presentation  of  the  multiphase  algorithm,  we  have  not  stated  which 
of  the  many  possible  partitions  of  the  integer  d  is  best  in  terms  of  total  time. 
The  total  number  of  partitions  of  the  integer  d  is  approximated  by:  [1,  8] 

”id)  ~  i 

which  is  a  slowly  growing  exponential,  with  p(20)  =  627.  It  is  feasible,  though 
neither  pfficient  nor  elegant,  to  enumerate  all  partitions  of  d  to  find  the  best 
algorithm  using  the  expression  for  ^multiphase  (2).  Furthermore,  t, multiphase  is 
not  convex  for  n  =  2.  It  is  therefore  not  possible  to  find  the  best  partition 
by  recursively  halving  d. 

The  objective  of  this  paper  is  to  carry  out  a  detailed  investigation  of  the 
multiphase  algorithm.  We  shall  be  concerned  with  the  hull  of  optimal* tv 
formed  by  the  straight  lines  that  describe  the  run  times  of  all  possible  multi¬ 
phase  algorithms  on  a  hypercube  of  dimension  d  plotted  against  the  message 
size  m.  We  shall  show  that  a  large  class  of  algorithms  can  never  be  optimal. 
Of  the  remaining  algorithms,  a  large  fraction  are  optimal  only  at  vertices  of 
the  hull  of  optimality  and  can  be  ignored.  These  results  permit  us  to  obtain 
a  bound  of  Q(\/d)  on  the  number  of  optimal  algorithms. 
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3.2  Notation 


Let  [»| .  «i][rt2,  « :«]  ’  ‘  '  denote  the  sequence  made  up  of  n  i  u^s,  followed 

by  n-i  «-/s,  etc.  Thus  [3,  2]  [4, 3] [2,  5]  denotes  the  sequence  {222333355}.  The 
elements  of  a  sequence  shall  always  be  enumerated  in  non-decreasing  order. 

Let  the  calligraphic  letter  -4,/  n  denote  an  arbitrary  partition  of  the  integer 
d  with  cardinality  n.  The  elements  of  this  partition  are  denoted  by  the 
lowercase  letters  at.  We  shall  omit  subscripts  when  they  are  irrelevant  to 
the  discussion.  Example:  two  of  many  possible  cardinality  (i  partitions  of 
the  integer  30  are  {224679}  and  {115788}.  Table  2  shows  the  partitions  of 
d  =  5.  Define  an  equipartition  of  the  integer  d  to  be  a  partition  in  which 


Table  2:  Partitions  of  the  integer  5. 


1 

5 

4 

2 

3 

1 

1 

3 

1 

2 

2 

1  1 

1 

2 

1  1  1 

1 

1 

the  largest  and  smallest  elements  differ  by  at  most  1.  An  equipartition  of  d 
with  cardinality  n  is  denoted  £<*,„•  By  definition,  Sd,i  —  d.  In  Table  2  the 
cardinality  3  equipartition  is  {122}. 

It  is  straightforward  to  verify  that 

£rf,n  - 

For  example  £i9i8  =  {22222333}  =  [5, 2] [3, 3].  Since  the  cardinality  n  equipar¬ 
tition  of  an  integer  d  is  unique,  there  are  d  unique  equipartitions  of  the  integer 

d. 

The  time  taken  by  a  set  of  partial  exchanges  corresponding  to  a  partition 
of  the  integer  e  <  d,  —  {m1?  m2,  •  •  •  ,m„}  on  a  dimension  d  hypercube 


n  —  d  mod  n,  [— J 
n 


d  mod  n,  [— 1 
n 


(3) 


8 


(4) 


is 

n 

0..M,  n  ^  ^ d,rn,  • 

i-l 

In  the  case  c  <  d,  M',n  is  not  a  partition  of  d  and  the  resultant  data  move¬ 
ment  is  not  a  complete  exchange.  Nevertheless  this  definition  is  important 
for  subsequent  analysis.  When  e  —  d,  .Vfen  is  a  partition  of  d,  and  the  set  of 
partial  exchanges  corresponding  to  Mt constitutes  a  multiphase  algorithm 
for  complete  exchange,  as  described  above.  We  shall  use  the  terms  ‘algo¬ 
rithm’  and  ‘partition’  interchangeably,  so  that  when  we  say  ‘time  requirt'd 
by  a  partition’,  we  mean  the  ‘time  required  by  a  set  of  partial  exchanges 
corresponding  to  that  partition’. 

Of  particular  interest  to  the  ensuing  discussion  is  the  time  required  by  an 
equipartition,  which  is  obtained  by  combining  (3)  and  (4): 

U,ed,n  =  (n  -  d  mod  n)td  +  (d  mod  n)td^^  (5) 

For  a  partition  Aa, „  =  {at,  a-*,  •  •  • ,  a„},  we  have 

U 

o,v4an  =  5Z  o.n, 

.=i 

=  [(1  -  2-°'  )r  +  *>]  2*m  +  (2“  -  l)(A  +  <$)  + 

[( 1  -  2~"  )t  +  p\  2*771  +  (2°2  -  1  )(A  +  6)  + 


[(1  —  2_an)r  +  p]  2dm  +  (2°n  -  l)(A  +  £) 

=  |(n  —  2_a’  -  2_'*2  —  —  2 -an)r  +  npj  2dm  + 

(2“*  +2a2  +  ---  +  ‘2a")(A  +  <5) 

This  prompts  us  to  define,  for  the  partition  Aa,n 

*)Aa.n  _  0rt|  q.  j  j  '7an 


and 


=  2~a'  +2~a2  +  •••  +  2‘ 


which  then  leads  to  the  compact  expression 

td,Aa.n  =  [(»  -  2~Aan)r  +  up]  2*771  +  (2Aan  -  n)( A  +  <*>). 


(6) 
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Since  every  element  of  A  is  at  least  1.  the  coefficient  of  in  in  the  above 
expression  is  >  0  as  is  the  coefficient  of  (A  +  b).  Thus  when  is  plotted 
against  in  we  obtain  a  line  with  positive  slope  and  intercept. 

For  an  equipart  it  ion  we  have 

Ue,.n  =  [(w  —  2~e'  "  )t  +  np\  2dm  +  (•/'"  -  n)(  A  +  *).  (7) 

3.3  Properties  of  Equipartitions 

Several  properties  of  2f'n  and  2~t'  n  shall  be  useful  in  the  ensuing  discussion 
and  are  presented  in  this  Section.  In  understanding  these  properties,  it  is 
useful  to  refer  to  Table  3  which  lists  *4e,„,  2A'n  and  2~A"‘  for  all  partitions  of 
e  =  7.  The  last  column  of  this  table  indicates  if  an  entry  is  an  equipartition. 


Table  3:  Partitions  of  the  integer  e  =  7. 


n 

2Arn 

■47,71 

1 

128 

0.007812 

7 

£7,1 

2 

66 

0.515625 

1  6 

2 

36 

0.281250 

2  5 

2 

24 

0.187500 

3  4 

£7,2 

3 

36 

1.031250 

1  1  5 

3 

22 

0.812500 

1  2  4 

3 

18 

0.750000 

1  3  3 

3 

16 

0.625000 

2  2  3 

£7,3 

4 

22 

1.562500 

1114 

4 

16 

1.375000 

112  3 

4 

14 

1.250000 

12  2  2 

£7,4 

5 

16 

2.125000 

11113 

5 

14 

2.000000 

1112  2 

£7,5 

6 

14 

2.750000 

111112 

£7,6 

7 

14 

3.500000 

1111111 

£7,7 

The  first  three  properties  arise  from  the  theory  of  Schur-convexity[12] 
which  we  summarize  as  follows. 
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1.  Given  X,  Y  £  IR",  with  5Z’l=1  xt  =  Vt-  Let  -^[i] ,  3/[t]  be  the  zth  largest 
component  of  X,  Y,  respectively. 

j  j 

We  say  X  X  Y  if  ^  ^  r/[t]  for  all  j  =  1 , 2,  •  •  •  n. 

t=l  !  =  1 

2.  4>  :  IRn  — ►  IR  is  called  Schur-convex  if,  whenever  X  X  Y,  then  $(X)  < 

*(Y). 

3.  If  g  :  IR  ->  IR  is  convex  then  4>(X)  =  22”=i  dixi )  >s  Schur-convex.  Exam¬ 
ples  of  such  functions  are  g\(x)  =  2x,g2{x)  —  1/2X. 

Property  1  For  any  1  <  n  <  e 

(a)  2£e •'  <  2A'n 

(b)  2~e  =  2~£el  <  2~Acn  <  f. 

Property  2 

(a)  2~£d’2  <  2~Ad’2 

(b)  2£d2  <  2Ad'2 . 

Property  3  2£e  n  <  2£e’n~l . 

Property  4  2_£'  n  —  2~£e'n-‘  <  3/4. 

Proof.  £e,n- 1  can  always  be  obtained  by  deleting  the  smallest  element  of 
£e%n  and  distributing  it  over  the  remaining  elements  of  £e<n.  Suppose 

£e,n  {^1,  ^-2}  >  } 

and  that,  for  some  k  <  n. 


ei  —  eiti  +  e|t2  +  •  •  •  +  t\,k 
all  of  which  are  greater  than  zero.  Then 

2  .  n  2  ,  n  —  1  

2  el » 1  ~el,2 - el)k 

2_e2(i  _2-e>.')-j-2“e3(l  -  2“ei-2)  +  •••2"e*+‘(l  -2"eifc)  + 

2-e*+2  -| - +  2_e". 
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This  is  a  positive  quantity  that  achieves  a  maximum  when  k  =  1  and  t  j.i  =  1, 
in  which  case  it  is 


2“'  +2-'(l  —  2' 1 )  =  3/4. 


We  have  mentioned  earlier  that  the  time  for  a  partition,  when  plotted 
against  m,  leads  to  a  line  with  positive  slope  and  intercept.  The  lines  corre¬ 
sponding  to  the  run  times  of  equipartitions  are  of  critical  importance  in  this 
discussion. 


Property  5  Consider  the  straight  lines  corresponding  to  the  two  equiparti¬ 
tions  £e<n  and  £e,n_i,  plotted  against  in.  Then 

(1)  tu  ec has  greater  slope  than  t( and 

(2)  has  smaller  intercept  than  tgtn_t. 


Proof.  We  have  from  (7) 


td,£,,n  = 

= 


[(71  -  2~£'n 
[(77  -  1  -  2 


“)r  +  np  2dtn  +  ( 2£e‘n  —  n)(A  -)-  <5) 


-s, 


,n_1  )t  +  (n  —  1  )p\  2dm  +  (2Se'n~‘ 


—  n  • f  1)(A  -j-  6) 


slope{tet  n)  -  slopt(i£tn_x)  =  2  £t'n~l  -  2~£*-n  +  1 

>  0  by  Property  4 

intercept(t£'  n)  —  intercept(ts9  =  2£e  n  —  2£e  n~x  —  1 

<0  by  Property  3 


The  times  taken  by  equipartitions  thus  form  a  hull  in  which  the  leftmost 
face  corresponds  to  a  partition  with  maximum  cardinality,  while  the  right¬ 
most  face  corresponds  to  a  partition  of  minimum  cardinality.  Faces  of  de¬ 
creasing  cardinality  lie  between  these  extreme  faces.  Figure  1(a)  shows  plots 
of  the  run  times  of  all  partitions  (not  necessarily  equipartitions)  of  d  =  4 
on  a  hypercube  of  dimension  4.  We  can  see  that  the  hull  of  optimality  is 
formed  by  equipartitions  {1111}, {22}  and  {4}.  The  non-equipartition  {13} 
does  not  touch  the  hull.  The  equipartition  {112}  touches  the  hull  but  does 
not  contribute  a  face  (it  passes  through  the  point  of  intersection  of  {1111} 
and  {22}).  Figure  1(b)  shows  the  times  for  all  partitions  of  d  =  6  on  a  hy¬ 
percube  of  dimension  6.  In  this  case  the  hull  is  formed  by  the  equipartitions 
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(a)  (b) 

Figure  1:  Run  times  for  d  —  4,6.  In  this  particular  example,  A  =  100,  S  = 
10  (n sec.)  and  r  =  2,p  =  1  (psec./byte).  Circle  indicates  the  point  of 
intersection  of  all  partitions  of  cardinality  2:{33},{24}  and  {15} 

{111111},  {222},  {33}  and  {6}.  Only  a  few  of  the  remaining  partitions  are 
labeled  to  avoid  a  congested  plot,  but  we  can  see  that  out  of  the  1 1  partitions 
of  the  integer  6,  only  the  abovementioned  4  equipartitions  contribute  a  face 
to  the  hull. 

We  now  prove  Properties  6  and  7  which  are  also  illustrated  in  Figure  1. 

Property  6  For  any  d, 

(a)  £d<  |  always  lies  on  the  hull,  and 

(b)  Sd4  always  lies  on  the  hull. 
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Proof.  Ad,n  represents  an  arbitrary  partition  of  cardinality  n.  From  (6)  and 
(7)  we  have 

td,Ad. n  =  [(«  -  2Adn)T  +  np\  2 dm  +  (2Adn  -  n)( A  +  6) 

td,s d-1  =  [( 1  -  2£d  '  )t  -  p\  2dm  +  (2£d-'  -  1  )(A  +  8) 

td,sd,d  =  [(</  -  2 £d-d)r  -  dp]  2 dm  +  [2£dd  -  d)( A  +  6) 

(a)  The  expression 

—  td,£di  1 

=  [(n  -  1  -  2~Ad  n  +  2~£d  i  y  i-  (n  -  l)p)]  +  (2Ad-n  -  2Sd  i  -  n  +  1)(A  +  8) 

>  [(^  +  2~d  -  l)r  +  (n  -  1  )p  2dm  +  (! 2Ad •-  -  2^'  -  n  +  1)(A  +  5) 

(by  Property  1(b)) 

which  is  always  positive  for  sufficiently  large  m  and  n  >  1  (for  n  =  1 ,  Ad,n  = 
and  the  property  hold  vacuously).  Thus  Sd, i  lies  below  any  for 
sufficiently  large  m. 

(b)  At  m  =  0,  the  expression 

td,Ad,n  ~  td,£d,d  =  (2Adn  -  2£<t'd  -n  +  d)( A  +  8) 

is  greater  than  zero,  since  2A<t  n  >  2£dd  (by  Property  1(a)),  and  d  >  n).  Thus, 
Sd,d  lies  below  Ad,n  for  m  =  0.  ■ 

The  partition  Sd, i  corresponds  to  the  Direct  algorithm,  while  Sd,d  is  equiv¬ 
alent  to  the  Standard  exchange.  These  two  algorithms  are  extreme  cases  of 
the  Multiphase  algorithm.  Property  6  tells  us  that  the  Direct  algorithm  is 
always  optimal  for  large  values  of  m,  while  Standard  exchange  is  always  best 
for  very  small  values. 

Property  7  Of  all  partitions  of  cardinality  2,  only  Sd,2  can  lie  on  the  hull. 

Proof.  Consider  the  two  partitions  Sd, 2  =  {c,d  —  e}  and  Ad, 2  —  {a,d  —  a}. 
We  have 

td,£d,2  =  [(2  —  2£d’2)r  +  2p]  2dm  +  {2£d'2){\  +  <5) 

td,Ada  =  [(2-2Ad*)T  +  2p\2dm+(2Ad*)(\  +  8) 
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Solving  for  td,£d  2  —  td,Ad2  we  obtain 


(2Ad-2  -2£d2)(\  +  8) 

m  =  - - - - - — 

[2~Ad-2  —  2  ~e*.i)T2d 

(2Ad’2  -  2?d-2)(\  +  8) 

~  (2~a  +  2~d+a  -  2_e  -  2~d+e)T2d 
(2a<2  -2£d2)(X  +  8) 

~  (2d~a  +  2“  -  2d~e  -  2 e)r 

{2Ad*  -2£d2)(X  +  6) 

~  (2Ad-2  -  2£d -2  )t 

X  +  8 

T 

which  is  independent  of  £d,2  and  Ad, i-  Thus  all  partitions  of  cardinality  2 
intersect  at  a  point. 

Since  2~Sd’2  <  2~Ad 2  and  2£d2  <  2Ad~2  (by  Property  2),  tt :d7  has  greater 
slope  and  lesser  intercept  than  tAd2-  Therefore  only  can  lie  on  the  hull 
for  m  <  (A  +  6)/t. 

At  m  =  (A  -f  8)/t  we  have 

td,£d,2  -  td,£di 

=  [(1  -  2~£d- 2  +  2~£d  l  )r  +  p ]  2^A*  ^  +  ( 2td '•»  -  2£dl  -  1)(A  +  8) 

=  [(1  -  2~e  -  2~d~e  +  2~d)2d  +  (2e  +  2d_e  -  2d  -  1)  +  p2d/r\  (A  +  8) 

-2d(X  +  8) 

T 

which  is  always  positive.  This  means  that  the  line  tsd  t  always  passes  below 
the  common  point  of  intersection  of  all  cardinality  2  partitions.  We  have 
already  shown  that  of  all  cardinality  2  partitions,  only  Ed, 2  can  lie  on  the  hull 
below  this  point.  Hence  of  all  cardinality  2  partitions  only  Ed, 2  can  lie  on  the 
hull.  ■ 

In  the  following,  we  shall  prove  that  a  non-equipartition  cannot  contribute 
a  face  to  the  hull  of  optimality  and  further  that  a  large  number  of  equipar- 
titions  can  at  most  touch  the  hull  at  a  vertex.  Therefore,  although  there  is 
an  exponential  number  of  partitions  of  an  integer  d,  we  shall  prove  that  the 
number  of  faces  on  the  hull  of  optimality  is  0(\/d)- 


3.4  Main  Theorems 


The  properties  proved  above  permit  us  to  determine  the  maximum  number 
of  faces  on  the  hull  of  optimality.  Table  4  lists  all  partitions  of  the  integers 


1  ..-7. 


Turning  our  attention  to  the  partitions  of  7,  we  see  that  if  we  select  all 
those  partitions  that  have  a  T  in  them  (these  are  boxed  in  the  table)  and 
then  delete  a  ‘1’  from  each  of  these,  we  obtain  the  partitions  of  6,  which 
are  given  in  the  next  column.  Similarly,  selecting  all  partitions  of  7  that 
have  a  ‘2’  in  them  and  then  deleting  a  ‘2’  from  each  of  these  will  result  in 
the  partitions  of  the  integer  5  and  so  on.  It  is  thus  clear  that  the  set  of  all 
partitions  of  the  integer  d  is  composed  of  the  union  of  the  sets  of  all  partitions 
of  the  integers  d  —  a,  1  <  a  <  d,  each  augmented  by  a  and  the  integer  d.  For 
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a  specific  partition  we  have 

■Ak,m  ~  -Ah-a.m  +  {«} 

where  we  take  the  ‘  +  ’  operation  to  mean  the  addition  of  an  element  to  a 
partition.  The  following  property  is  evident. 

Property  8  tdtAkm  =  tdiAk_am  +  td,{a}. 

It  follows  that  the  straight  lines  describing  the  run  times  of  all  partitions  of 
an  integer  d  can  be  obtained  by  adding  td  ^  to  the  run  times  of  all  partitions 
of  the  integer  d  —  a,  and  then  adding  the  line  t( :dl.  This  permits  us  to  prove 
the  following  Theorem. 

Theorem  1  A  non-cquipartition  cannot  touch  the  hull  of  optimality. 


Proof.  By  induction  on  the  partitions  of  integers  <  d. 

Basis  step:  The  smallest  integer  that  has  a  non-equipartition  is  4,  which 
has  only  one:  {13}.  As  the  basis  step  of  our  induction,  we  shall  prove  that 
Id, {13}  can  never  touch  the  hull. 

The  equations  for  the  5  partitions  of  4  are,  from  (1), 

Id, {4}  —  15  A  +  15  cr  +  2d  m  {^p  + 

) 

Id, {22}  —  6A  +  6cr  +  2tim  (2  p  + 

Id, {\22}  —  5A  +  5cr  +  2 dm  ^ 

^,{1111}  =  4A  +  4cr  +  2dm  (4  p  +  2  r) 

The  point  of  intersection  of  ^,{4}  and  td^ 22}  >s 

144  (A  +  cr) 

m  =  — — - . 

2d  (16p  +  9 r) 

At  this  value  of  rn  we  have 


i,{i3}  —  8A  +  8<r  +  2dm  (2 p  + 


id, {4}  —  I'd, {13} 


32 p  (A  +  cr) 
16  p  +  9  r 
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which  is  always  negative. 

The  lines  td,{ 22}>  ^,{122}  and  ^,{nn}  intersect  at  a  single  point  which 
occurs  at 

4  (A  +  cr) 
m  2^(4  p  +  r)' 

At  this  value  of  m  we  have 


^,{1111}  —  ^d,{13} 


-  ((A  +  a)  (16/3  +  3t)) 
2  (4  p  +  t) 


which  is  also  always  negative.  Thus  the  partition  {13}  can  never  touch  the 
hull  of  optimality. 

Induction:  Suppose  the  theorem  is  true  for  all  partitions  of  the  integer 
k  <  d.  Partitions  of  the  integer  k  +  1  can  be  obtained  by  adding  1,2,  •  •  • ,  k 
to  the  partitions  of  the  integers  k,k  —  1 ,  *  •  • ,  1 ,  and  then  adding  £*,1  =  k 
as  discussed  above.  The  corresponding  run  times  are  obtained  by  adding 
td,\,  td,2,  •  •  • ,  tdtk  to  run  times  of  all  the  constituent  partitions,  as  stated  in 
Property  8.  Each  time  we  add  td,a  to  all  the  partitions  of  a  certain  integer 
we  raise  the  hull  of  optimality  and  all  other  lines  by  a  linear  amount.  The 
resultant  hull  of  optimality  of  cardinality  k  +  1  will  be  the  intersection  of  the 
hulls  of  cardinality  1,2,  •  •  • ,  k.  A  line  that  did  not  touch  one  of  the  constituent 
hulls  cannot  touch  the  intersected  hull. 

When  a  partition  is  augmented,  a  new  non-equipartition  of  cardinality  k 
can  be  created  by  augmenting  (1)  a  non-equipartition  of  cardinality  j  or  (2) 
an  equipartition  of  cardinality  j.  In  the  first  case  our  hypothesis  continues 
to  hold  since  a  non-equipartition  not  touching  the  hull  is  transformed  into 
an  non-equipartition  that  still  does  not  touch  the  hull. 

The  second  case  requires  careful  analysis.  When  an  existing  equipartition 
of  cardinality  j,£d,},  that  by  hypothesis  must  touch  the  hull,  is  transformed 
into  a  non-equipartition  £d,j  +  {k  —  j}  we  have  two  possibilities 

k  >  3  Consider  the  partition  obtained  by  deleting  one  of  the  original  ele¬ 
ments,  m  6  Sd,}  from  SdtJ  +  —  j}.  This  new  partition  must  be  a 

non-equipartition  of  cardinality  d  —  m.  In  the  hull  for  d  —  m  it  could 
not  have  touched  the  hull,  being  ‘masked’  by  equipartitions  of  cardi¬ 
nality  d  —  m,  and  therefore  it  can  now  also  not  touch  the  hull  after 
augmentation. 
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k  —  2  In  this  rase  Property  7  states  that  only  the  equipartition  of  cardinality 
2  can  lie  on  the  hull. 

Therefore  no  non-equipartition  can  touch  the  new  augmented  hull  of  op¬ 
timality  for  k  +  1.  We  have  proved  that  if  the  theorem  is  true  for  k  is  is  also 
true  for  k  +  1.  We  have  already  shown  that  it  is  true  for  k  =  2,3,4.  Thus  it 
is  true  for  all  k.  m 

An  important  consequence  of  Theorem  1  is  the  fact  that  even  though 
there  is  an  exponential  number  of  partitions  of  d,  the  total  number  of  faces 
on  the  hull  of  optimality  cannot  exceed  d,  the  number  of  equipartitions.  We 
shall  continue  with  further  investigations  into  the  properties  of  equipartitions. 
These  will  permit  us  to  improve  the  bound  on  the  number  of  faces  to  0(\/d). 
At  this  point  we  prove  a  theorem  that  shall  permit  us  to  place  a  lower  bound 
on  the  number  of  faces  on  the  hull. 

Theorem  2  Every  equipartition  must  touch  the  hull  of  optimality. 

Proof.  By  induction  on  d. 

Basis  step:  The  theorem  is  true  for  d  —  2,  since  by  Property  6,  both  {11} 
and  {2}  must  lie  on  the  hull. 

Induction:  Assume  the  theorem  is  true  for  d  —  n.  Then  the  hull  of  optimal¬ 
ity  is  touched  by  all  equipartitions  1  <  i  <  n.  The  set  of  equipartitions 
of  the  integer  n  -f  1,  that  is  £n+i,i,  can  be  formed  from  the  set  of  equiparti¬ 
tions  of  the  integer  t i,  by  adding  1  to  the  smallest  element  of  each  and 
then  adding  the  new  equipartition  £n+ljn+]. 

Turning  to  the  corresponding  run  times,  this  operation  is  equivalent  to 
adding 

=  (~T  T  p)2dm  +  (A  +  8) 

to  each  of  tgn%,  l  <  i  <  n  (see  equation  (1)).  Since  the  same  linear  expression 
is  added  to  each  tsni,  the  relationships  between  these  lines  is  undisturbed  and 
the  augmented  hull  is  touched  by  all  of  the  augmented  equipartitions.  Now 
consider  ff„+)  n+1;  this  must  touch  the  hull  because  of  Property  6(b).  Thus 
the  hull  of  optimality  of  the  integer  n  +  1  is  touched  by  all  equipartitions  of 
n  -f  1. 

We  have  proved  the  theorem  to  be  true  for  d  =  3.  We  have  shown  that  if 
it  is  true  for  d  =  n  it  is  also  true  for  d  =  n  +  1.  It  is  therefore  true  for  all  d.  ■ 
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An  equipartition  £dn  can  only  have  two  distinct  elements:  [d/nj  and 
\d/n].  In  some  cases  sequences  of  several  different  equipartitions  have  the 
same  two  distinct  elements.  For  example,  in  Table  4,  £6,6  =  [6, 1],  £6,5  = 
(5, 1][1,2],  Se,4  =  [2, 1  ] [2, 2]  and  £6, 3  =  [3,2].  All  these  partitions  are  com¬ 
posed  of  l’s  and  2’s  exclusively.  Similarly,  the  following  equipartitions  of  the 
integer  19, 

£19,7  =  {2233333} 

£19i8  =  {22222333} 

£19,9  =  {222222223} 

are  all  composed  of  2’s  and  3’s  exclusively.  We  call  such  equipartitions  indis¬ 
tinct.  It  is  clear  that  indistinct  partitions  always  have  successive  cardinality 
values. 

Theorem  3  The  run  time  functions  of  indistinct  equipartitions  are  linearly 
dependent. 

Proof.  Consider  three  indistinct  equipartitions  of  cardinality  p,  p  +  1  and 
p  —  1  that  are  composed  of  the  elements  Q  and  fi  +  1.  Then  for  some  a ,  0, 7 

£j,p  =  [a,  SI]  [p-a,  ft+1] 

£d,P+ 1  =  [p,  ft]  b+i  —  Pi  ft  + 1] 

£d,p~  1  =  [7,  ft]  [p  - i  -  7>  ft  + 1] 

The  times  for  these  equipartitions  are 

ted,p  =  at*#  +  (p  -  a)td,n+i  (8' 

^d,P+ 1  =  fttd.n  +  {p  +  1  —  0)tdM+ 1  (9) 

=  TUsi  +  (P  ~  1  -  7)Un+i  (10) 

Since  we  are  dealing  with  equipartitions  of  the  integer  d, 

d  =  aft  +  (p  —  a)(ft  +  1) 

=  pn  +  (P+  i-/?)(fi  +  i) 

-  7ft  +  (p  -  1  -  7)(ft  +  1) 

These  yield  the  following  relations 

0  =  1  -t-  a  -(-  ft 

7  =  — 1  +  a  —  ft 
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(11) 

(12) 


Substituting  (11)  and  (12)  in  (9)  and  (10)  we  obtain  the  system 

tedl,  =  +  (p  —  njU,  n+i  (13) 

^>,,+1  =  ( 1  +  o  +  Q)td,n  +  (p  +  «  —  fl)0.o+i  (14) 

tt'd.p-i  =  (  —  1  +  a  —  +  (p  —  o  +  n+i  (15) 

Adding  (14)  and  (15)  we  obtain 

^'d,p+i  +  hd}t_x  =  2atd,u  +  2{p  —  ot)tdiQ+i 

Hence  the  system  is  linearly  dependent.  m 

Theorem  3  assures  us  that  all  members  of  a  set  of  indistinct  equipartitions 
intersect  at  a  single  point.  Therefore  only  two  of  these  can  contribute  faces 
to  the  hull  of  optimality,  since  they  have  successively  decreasing  slopes  and 
increasing  intercepts  (Property  5).  For  example,  in  the  hull  for  d  =  4  (Figure 
1),  we  can  see  that  the  equipartitions  {1111},  {112}  and  {22}  intersect  at  a 
point  and  only  {1111}  and  {22}  contribute  faces  to  the  hull. 

3.5  Faces  on  the  Hull 

From  the  foregoing  discussion  we  can  see  that  all  equipartitions  touch  the 
hull.  Each  distinct  equipartition  contributes  a  single  face  to  the  hull  while 
each  set  of  indistinct  equipartitions  contributes  two  faces.  To  find  a  bound 
on  the  number  of  faces  on  the  hull,  refer  to  Figure  2  which  plots  \d/n\, 
and  \d/n]  versus  n  for  d  =  11  and  16.  In  earh  of  these  plots,  the  dashed 
curve  represents  the  the  continuous  function  d/n.  The  values  of  [rf/nj,  and 
\d/n~\  are  indicated  by  heavy  dots.  When  [d/n\  <  \d/n] ,  there  is  a  vertical 
line  joining  these  dots.  In  the  plot  for  d  =  1 1  we  have  enumerated  all  the 
equipartitions  in  full,  while  in  the  plot  for  d  —  16  we  have  used  the  compact 
notation  (3).  The  lines  marked  with  '+’s  are  tangents,  with  slope  —  1,  at  the 
point  [yd\ .  [\/d\ . 

Over  the  range  1  <  n  <  \fd  the  slope  of  these  hyperbolas  is  less  than 
—  1  and  therefore  no  two  consecutive  equipartitions  can  have  an  element  in 
common.  All  equipartitions  in  this  range  are  distinct  and  their  number  is 
equal  to  the  number  of  integers  in  this  range,  which  is  \  \fd\-  This  equals  the 
number  of  'T's  on  the  tangent  between  n  —  1  and  n  —  [Vd\. 
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d-  16 


Figure  2:  Plots  of  [d/nj,  \d/n]  for  d  =  11  and  16. 

Indistinct  equipartitions  can  only  occur  over  the  range  \/d  <  n  <  d. 
The  number  sets  of  in  distinct  equipartitions  is  no  more  than  the  number  of 
distinct  values  of  [d/n\ ,  which  is  the  number  of  ‘+’s  on  the  tangent  between 
\fd  and  d,  and  is  again  |_\/a| . 

In  the  range  1  <  n  <  vd,  there  are  no  indistinct  equipartitions,  so  one 
face  :s  contributed  to  the  hull  by  each  equipartition,  giving  us  a  total  of  \  \/d\ 
faces.  In  the  range  <  n  <  d  there  may  be  up  to  [Vd\  sets  of  indistinct 
equipartitions,  each  contributing  at  most  2  faces  and  at  least  one  face  to  the 
hull.  An  upper  bound  on  the  total  number  of  faces  on  the  hull  is  therefore 
3[\/dJ.  To  obtain  a  lower  bound  note  that  the  hyperbola  is  symmetric  about 
the  line  n  =  d/n  (the  line  through  the  origin  with  slope  1).  If  the  point 
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lies  on  this  line  the  number  of  distinct  levels  is  2[\/</J  —  1  and 
is  2[v«J  otherwise.  Each  level  must  contribute  at  least  one  face  to  the  hull. 
Thus  the  lower  bound  on  the  number  of  faces  is  2  Iv/Sj  -  i. 

The  equipartitions  that  contribute  to  the  hull  can  be  found  by  visiting 
all  0(\/d)  points  on  the  tangent.  For  1  <  n  <  [\/dJ ,  each  point  corresponds 
to  the  equipartition  For  each  n  in  the  range  \_Vd\  •  •  ■  -[Vd\  there  is  a 

secpience  of  indistinct  equipartitions  extending  from  \d/(n  +  1)]  to  [d/n J. 
We  need  consider  only  the  first  and  last  members  of  these  sequences.  Thus 
all  partitions  contributing  to  the  hull  can  be  found  in  (r)(\/d)  time.  Once 
these  partitions  have  been  found,  the  vertices  of  the  hull  can  he  discovered 
by  computing  the  intersection  points  of  adjacent  partitions,  again  in  0  (Vd) 
time.  The  intersection  points  will  be  computed  in  order  and,  once  they  have 
been  stored,  the  optimal  algorithm  for  any  value  of  m  can  be  found  using  a 
binary  search  in  0(log d)  time. 

4  Conclusions 

We  have  analyzed  the  multiphase  complete  exchange  algorithm  and  shown 
that  the  total  number  of  optimal  algorithms  lies  between  2 \fd  —  1  and  3 \fd. 
This  holds  under  the  assumption  that  the  time  for  transmitting  a  message  is 
independent  of  the  number  of  communication  links  traversed.  High  perfor¬ 
mance  parallel  machines  satisfy  this  assumption. 

In  addition  to  its  theoretical  interest,  this  result  is  of  considerable  prac¬ 
tical  importance.  It  allows  us  to  compute  the  optimal  algorithm  for  any 
given  values  of  hypercube  performance  parameters  and  message  length  very 
quickly.  When  dealing  with  an  application  where  the  performance  parame¬ 
ters  (Table  1 )  are  fixed  and  the  message  lengths  for  complete  exchange  vary 
from  time  to  time,  the  values  of  message  length  m  at  which  vertices  of  the 
hull  of  optimality  occur  can  be  computed  ahead  of  time  and  stored  in  a  sorted 
list.  During  the  course  of  program  execution,  a  fast  binary  search  will  locate 
the  optimal  algorithm  for  the  current  message  size. 

When  the  performance  parameters  vary  with  time,  as  would  happen  if 
the  communication  network  were  shared  among  several  subcubes,  our  results 
provide  a  fast  method  for  computing  the  optimal  algorithm  from  scratch. 
A  related  situation  is  where  the  same  application  is  run  on  hypercubes  of 
different  sizes. 
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Among  the  future  directions  of  this  research,  the  foremost  issue  is  an  ex¬ 
tension  to  2  and  3  dimensional  meshes.  Preliminary  results  on  2-dimensional 
meshes  appear  in  [5],  Since  the  time  required  for  ‘direct’  complete  exchanges 
on  A-processor  2  and  3-d  meshes  is  0(N 3//2)  and  0(A4/3)  respectively[15], 
compared  to  the  hypercube’s  0(Ar),  any  improvements  will  be  especially 
welcome. 
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